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This is a high level overview only. For details, see: 

Pattern Recognition and Machine Learning, Christopher Bishop, Springer-Verlag, 2006 
Or 

Pattern Classification by R. O. Duda, P. E. Hart, D. Stork, Wiley and Sons. 


Thomas Bayes 
1702 - 1761 


We will start off with a visual intuition, before looking at the math 




Grasshoppers 


Katydids 



Abdomen Length 

Remember this example? 
Let’s get lots more data... 
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With a lot of data, we can build a histogram. Let us 
just build one for “Antenna Length” for now... 



0 Katydids 
• Grasshoppers 




We can leave the 
histograms as they are 
or we can summarize 
them with two normal 
distributions. 


Let us us two normal 
distributions for ease 
of visualization in the 
following slides... 



• We want to classify an insect we have found. Its antennae are 3 units long. 
How can we classify it? 


• We can just ask ourselves, give the distributions of antennae lengths we have 
seen, is it more probable that our insect is a Grasshopper or a Katydid. 

• There is a formal way to discuss the most probable classification... 

p(Cj\ d) = probability of class a, given that we have observed d 














p{C:\ d) = probability of class c,, given that we have observed d 


P(Grasshopper | 3) = 10/(10 + 2) = 0.833 
P (Katydid | 3 ) = 2 / (1 0 + 2) = 0.166 



Antennae length is 3 

















p{C:\ d) = probability of class c,, given that we have observed d 


P(Grasshopper | 7 ) = 3 / (3 + 9) = 0.250 

P (Katydid | 7 ) = 9 / (3 + 9) = 0.750 



Antennae length is 7 



























p{C:\ d) = probability of class c,, given that we have observed d 


P(Grasshopper | 5 ) = 6 / (6 + 6) = 0.500 

P (Katydid | 5 ) = 6 / (6 + 6) = 0.500 



Antennae length is 5 




















Bayes Classifiers 


That was a visual intuition for a simple case of the Bayes classifier, 
also called: 

• Idiot Bayes 

• Naive Bayes 

• Simple Bayes 

We are about to see some of the mathematical formalisms, and 
more examples, but keep in mind the basic idea. 

Find out the probability of the previously unseen instance 
belonging to each class, then simply pick the most probable class. 


Bayes Classifiers 

Bayesian classifiers use Bayes theorem, which says 

p(cj I d ) = p(d I Cj ) p(cj) 

p(d) 

p(Cj I d) = probability of instance d being in class c, 

This is what we are trying to compute 

p(d I Cj) = probability of generating instance d given class c.-, 

We can imagine that being in class c, causes you to have feature d 
with some probability 

pied = probability of occurrence of class c -, 

This is just how frequent the class c, is in our database 

p(d) = probability of instance d occurring 

This can actually be ignored, since it is the same for all classes 



Assume that we have two classes 
Cj = male, and c 2 = female. 


We have a person whose sex we do not 
know, say “drew” or d. 

Classifying drew as male or female is 
equivalent to asking is it more probable 
that drew is male or female, I.e which is 
greater /Xmale I drew ) or /Xfemale I drew) 


(Note: “Drew 
can be a male 
or female 
name”) 



Drew Carey 

What is the probability of being called 
“drew” given that you are a male? 

What is the probability 
of being a male? 

/?(male I drew) = p(drew I male ) /?(male) 



p(drew) * 


What is the probability of 
being named “drew”? 








This is Officer Drew (who arrested me in 
1997). Is Officer Drew a Male or Female? 

Luckily, we have a small 
database with names and sex. 

We can use it to apply Bayes 
rule... 


p{cj I d) =p(d\ Cj ) p(cj ) 

P(d) 


Name 

Sex 

Drew 

Male 

Claudia 

Female 

Drew 

Female 

Drew 

Female 

Alberto 

Male 

Karin 

Female 

Nina 

Female 

Sergio 

Male 



Officer Drew 
















p(Cj I d) = p{d I c,-) p(c,) 


j 


j' 


Officer Drew 


/?(male I drew) = 1/3 * 3/8 = 0.125 


p(d) 


3/8 


^(female I drew) = 2/5 * 5/8 = 0.250 


3/8 


Name 

Sex 

Drew 

Male 

Claudia 

Female 

Drew 

Female 

Drew 

Female 

Alberto 

Male 

Karin 

Female 

Nina 

Female 

Sergio 

Male 


Officer Drew is 
more likely to be 
a Female. 


m 
























Officer Drew IS a female! 



Officer Drew 


p ( male 1 drew ) = 1/3 * 3/8 

= 0.125 

3/8 


pffemale 1 drew) = 2/5 * 5/8 

= 0.250 

3/8 













So far we have only considered Bayes 
Classification when we have one 
attribute (the “antennae length ”, or the 
“name”). But we may have many 
features. 

How do we use all the features? 


p(Cj I cl) = p(d I c j) p(Cj) 

7(d) 


Name 

Over 170cm 

Eye 

Hair length 

Sex 

Drew 

No 

Blue 

Short 

Male 

Claudia 

Yes 

Brown 

Long 

Female 

Drew 

No 

Blue 

Long 

Female 

Drew 

No 

Blue 

Long 

Female 

Alberto 

Yes 

Brown 

Short 

Male 

Karin 

No 

Blue 

Long 

Female 

Nina 

Yes 

Brown 

Short 

Female 

Sergio 

Yes 

Blue 

Long 

Male 
















To simplify the task, naive Bayesian classifiers assume 
attributes have independent distributions, and thereby estimate 


p(d\cj) =p(d 1 \c j ) *p(d 2 \cj) * 



i 

The probability of 
class Cj generating 
instance d , equals.... 

The probability of class 
generating the observed 
value for feature 1, 
multiplied by.. 




The probability of class c ; 
generating the observed 
value for feature 2, 
multiplied by.. 




To simplify the task, naive Bayesian classifiers 
assume attributes have independent distributions, and 
thereby estimate 

P(d\cj) =p(d l \c j ) * p(d 2 \Cj) * ...* p(d n \cj) 


/>(officer drewlc) = /Xover_170 cm = yeslc ) * p(e ye =blue\c) * 



Officer Drew 
is blue-eyed, 
over 170 cm 
tall, and has 
long hair 


^(officer drewl Female) = 2/5 * 3/5 * , 
^(officer drewl Male) = 2/3 * 2/3 * . 


The Naive Bayes classifiers 
is often represented as this 
type of graph... 

Note the direction of the 
arrows, which state that 
each class causes certain 
features, with a certain / 
probability / 






• • • 






Naive Bayes is fast and 
space efficient 


We can look up all the probabilities 
with a single scan of the database and 
store them in a (small) table... 



Sex 

Overl90 cm 


Male 

Yes 

0.15 

No 

0.85 

Female 

Yes 

0.01 

No 

0.99 


P(dJCj) 



Sex 

Long Hair 


Male 

Yes 

0.05 

No 

0.95 

Female 

Yes 

0.70 

No 

0.30 



Sex 


Male 


Female 




















Naive Bayes is NOT sensitive to irrelevant features... 

Suppose we are trying to classify a persons sex based on 
several features, including eye color. (Of course, eye color 

is completely irrelevant to a persons gender) 

^(Jessica I c-) = p(e ye = brownlcy) * p( wears_dress = yes Icy) * .... 

^(Jessica I Female) = 9,000/10,000 * 9,975/10,000 * ... 

^(Jessica I Male) = 9,001/10,000^\* 2/10,000 * 

^ Almost the same! 


However, this assumes that we have good enough estimates of 
the probabilities, so the more data the better. 


An obvious point . I have used a 
simple two class problem, and 
two possible values for each 
example, for my previous 
examples. However we can have 
an arbitrary number of classes, or 
feature values 



Animal 

Mass >10 kg 


Cat 

Yes 

0.15 

No 

0.85 

Dog 

Yes 

0.91 

No 

0.09 

Pig 

Yes 

0.99 

No 

0.01 




Animal 

Color 


Cat 

Black 

0.33 

White 

0.23 

Brown 

0.44 

Dog 

Black 

0.97 

White 

0.03 

Brown 

0.90 

Pig 

Black 

0.04 


0.01 


White 
































Sex 

Over 6 
foot 


Male 

Yes 

0.15 

No 

0.85 

Female 

Yes 

0.01 

No 

0.99 


Sex 

Over 200 
pounds 


Male 

Yes 

0.11 

No 

0.80 

Female 

Yes 

0.05 

No 

0.95 



















p(d 2 \c,) 



PWCi) 


Sex 

Over 6 
foot 


Male 

Yes 

0.15 

No 

0.85 

Female 

Yes 

0.01 

No 

0.99 


Sex 

Over 200 pounds 


Male 

Yes and Over 6 foot 

0.11 


No and Over 6 foot 

0.59 


Yes and NOT Over 6 foot 

0.05 


No and NOT Over 6 foot 

0.35 


Ypc q n r\ Hvpr f\ fnnf 

n m 
























p(d 2 \c,) 



p(d\c : ) 


But how do we find the set of connecting arcs?? 



The Naive Bayesian Classifier has a piecewise quadratic decision boundary 



Adapted from slide by Ricardo Gutierrez-Osuna 















IY(f)l 


0.2 

0.1 

0 

- 0.1 

- 0.2 


One second of audio from the laser sensor. Only 
Bombus impatiens (Common Eastern Bumble 
Bee) is in the insectary. 



Background noise 


Bee begins to cross laser 
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400 


Wing Beat Frequency Hz 



500 


600 


700 
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400 


Wing Beat Frequency Hz 








400 


500 


600 


700 


Anopheles stephensi : Female 

mean =475, Std = 30 


Aedes aegyptii: Female 

mean =567, Std = 43 


If I see an insect with a wingbeat frequency of 500, what is it? 

X (500-475 ) 2 

P(Anopheles\wingbeat = 500) = —=—e 2x30 z 

y2n 30 











517 


400 


500 


What is the error rate? 


Can we get more features? 


600 


700 


12.2% of the 
area under the 
pink curve 


8.02% of the 
area under the 
red curve 









Circadian Features 













Suppose I observe an 
insect with a wingbeat 
frequency of 420Hz 


What is it? 





Suppose I observe an 
insect with a wingbeat 
frequency of 420Hz at 

11:00am 


What is it? 


400 500 600 


o 


A 


A 

A 



0 

Midnight 


24 

Midnight 











Suppose I observe an 
insect with a wingbeat 
frequency of 420 at 
11:00am 

What is it? 



(Culex | [420Hz,11:00am]) 
(Anopheles | [420Hz,11:00am]) 
(Aedes | [420Hz, 11:00am]) 


= (6/(6 + 6 + 0)) * (2/ (2 + 4 + 3 )) =0.111 
= (6/(6 + 6 + 0)) *(4/(2 + 4 + 3 )) =0.222 
= (0/ (6 + 6 + 0)) *( 3/(2 + 4 + 3 )) =0.000 











Which of the “Pigeon Problems” can be 
solved by a decision tree? 





































Advantages/Disadvantages of Naive Bayes 

• Advantages: 

- Fast to train (single scan). Fast to classify 

- Not sensitive to irrelevant features 

- Handles real and discrete data 

- Handles streaming data well 

• Disadvantages: 

- Assumes independence of features 



